Rethinking Reward Models for Multi-Domain Test-Time Scaling

This repository contains the codebase for our paper, "Rethinking Reward Models for Multi-Domain Test-Time Scaling."

Abstract

Quick Start

conda create -n multi-rm python=3.10.14
conda activate multi-rm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Synthetic Verfication Rationale Generation for gORM/gPRM (optional)

# TASK_TYPE can be one of:
# gORM / gPRM
TASK_TYPE=[choose_one_above]

# generate data
python -m data_generation.generate_data \
  --output_dir [OUTPUT_DIR] \
  --task_type ${TASK_TYPE}

# preprocess data
python -m data_generation.preprocess_data \
  --output_dir [OUTPUT_DIR] \
  --task_type ${TASK_TYPE}

# shorten critique (optional)
python -m data_generation.shorten_critique \
  --output_dir [OUTPUT_DIR] \
  --task_type ${TASK_TYPE}

Training

# Training dORM / dPRM
# Use the appropriate config file:
# ./configs/dORM-14B.yaml
# ./configs/dPRM-14B.yaml
# ./configs/dORM-8B.yaml
# ./configs/dPRM-8B.yaml

accelerate launch -m discriminative.train \
  --config ./configs/dORM-14B.yaml \
  --output_dir ./[TRAINING_RESULTS]/dORM-14B \
  --per_device_batch_size 4 \
  --category all

# Training gORM / gPRM
# Use the appropriate config file:
# ./configs/gORM-14B.yaml
# ./configs/gPRM-14B.yaml
# ./configs/gORM-8B.yaml
# ./configs/gPRM-8B.yaml

accelerate launch -m generative.train \
  --config ./configs/gORM-14B.yaml \
  --output_dir ./[TRAINING_RESULTS]/gORM-14B \
  --per_device_batch_size 4 \
  --category all

Inference (reward)

# TEST can be one of:
# test (CoTs generated by Llama3.1-8B-Instruct)
# test_smollm (CoTs generated by Smollm3)
# test_qwen (CoTs generated by Qwen2.5-7B-Instruct)
# test_gemma (CoTs generated by gemma2-9B-it)
# test_llama (CoTs generated by Llama3.1-80B-Instruct)

# Inference for dORM / dPRM
# Use the appropriate model checkpoint:
# dongboklee/dORM-14B
# dongboklee/dPRM-14B
# or use your own trained models

python -m discriminative.get_reward \
  --data_path dongboklee/[TEST] \
  --model_id dongboklee/dORM-14B \
  --output_dir ./[REWARD_RESULTS]/dORM-14B-[TEST] \
  --per_device_batch_size 8 \
  --category all

# Inference for gORM / gPRM
# Use the appropriate model checkpoint:
# dongboklee/gORM-14B-merged, TASK_TYPE=gORM
# dongboklee/gPRM-14B-merged, TASK_TYPE=gPRM

python -m generative.get_reward \
  --data_path dongboklee/[TEST] \
  --model_id dongboklee/gORM-14B-merged \
  --output_dir ./[REWARD_RESULTS]/gORM-14B-[TEST] \
  --task_type gORM \
  --category all

# Inference for gORM / gPRM (for your own trained models)
# Use the appropriate model checkpoint:
# [LOCAL_DIR]/gORM-14B, TASK_TYPE=gORM
# [LOCAL_DIR]/gPRM-14B, TASK_TYPE=gPRM

# Merge LoRA for vLLM inference
python -m generative.merge_lora \
  --input_dir [LOCAL_DIR]/gORM-14B # saved to [LOCAL_DIR]/gORM-14B/tmp

python -m generative.get_reward \
  --data_path dongboklee/[TEST] \
  --model_id [LOCAL_DIR]/gORM-14B/tmp \
  --output_dir ./[REWARD_RESULTS]/gORM-14B-[TEST] \
  --task_type gORM \
  --category all

Evaluation

# TEST can be one of:
# test (CoTs generated by Llama3.1-8B-Instruct)
# test_smollm (CoTs generated by Smollm3)
# test_qwen (CoTs generated by Qwen2.5-7B-Instruct)
# test_gemma (CoTs generated by gemma2-9B-it)
# test_llama (CoTs generated by Llama3.1-80B-Instruct)
TEST=[choose_one_above]

# Or use your own reward dirs instead of HF hubs:
# [MODEL_NAME]/[TEST]/[CATEGORY]_reward.json
python -m evaluation.evaluate \
  --data_path dongboklee/${TEST} \
  --output_dir [OUTPUT_DIR] \
  --reward_dirs \
    dongboklee/dORM-14B-${TEST} \
    dongboklee/dPRM-14B-${TEST} \
    dongboklee/gORM-14B-${TEST} \
    dongboklee/gPRM-14B-${TEST} \  
  --model_names dORM-14B dPRM-14B gORM-14B gPRM-14B \
  --strategies last min mean mean \
  --num_runs 100

# CSV_FILE can be one of:
# [OUTPUT_DIR_FROM_ABOVE]/best_of_n.csv
# [OUTPUT_DIR_FROM_ABOVE]/weighted_vote.csv
CSV_FILE=[choose_one_above]

# [OUTPUT_FILE_PREFIX]=example
# -> example_legend.png / example_legend.pdf
# -> example.png / example.pdf
python -m evaluation.plot \
  --input_file ${CSV_FILE} \
  --output_file [OUTPUT_FILE_PREFIX]

Assets

Please find the assets of this repo below, including training and test datasets, model checkpoints, and rewards obtained by the four reward model variants.

Datasets

Name	Description
train	multi-domain training dataset for dORM/dPRM (mostly adapted from VersaPRM).
train_gORM	multi-domain training dataset for gORM generated by QwQ-32B.
train_gPRM	multi-domain training dataset for gPRM generated by QwQ-32B.
test	multi-domain test dataset with CoTs (N=128) generated by Llama3.1-8B-Instruct (mostly adapted from VersaPRM).
test_smollm	multi-domain test dataset with CoTs (N=16) generated by SmolLM3-3B.
test_qwen	multi-domain test dataset with CoTs (N=16) generated by Qwen2.5-7B-Instruct.
test_gemma	multi-domain test dataset with CoTs (N=16) generated by gemma-2-9b-it.
test_llama	multi-domain test dataset with CoTs (N=16) generated by Llama-3.1-70B-Instruct.

Model Checkpoints

Name	Backbone	Trained On	LoRA-merged version
dORM-14B	14B backbone	train	—
dPRM-14B	14B backbone	train	—
gORM-14B	14B backbone	train_gORM	gORM-14B-merged
gPRM-14B	14B backbone	train_gPRM	gPRM-14B-merged
dORM-8B	8B backbone	train	—
dPRM-8B	8B backbone	train	—
gORM-8B	8B backbone	train_gORM	gORM-8B-merged
gPRM-8B	8B backbone	train_gPRM	gPRM-8B-merged

Rewards on Datasets by Model

Name	Model	Dataset
dORM-14B-test	dORM-14B	test
dORM-8B-test	dORM-8B	test
dORM-14B-test_smollm	dORM-14B	test_smollm
dORM-14B-test_qwen	dORM-14B	test_qwen
dORM-14B-test_gemma	dORM-14B	test_gemma
dORM-14B-test_llama	dORM-14B	test_llama
dPRM-14B-test	dPRM-14B	test
dPRM-8B-test	dPRM-8B	test
dPRM-14B-test_smollm	dPRM-14B	test_smollm
dPRM-14B-test_qwen	dPRM-14B	test_qwen
dPRM-14B-test_gemma	dPRM-14B	test_gemma
dPRM-14B-test_llama	dPRM-14B	test_llama
gORM-14B-test	gORM-14B-merged	test
gORM-8B-test	gORM-8B-merged	test
gORM-14B-test_smollm	gORM-14B-merged	test_smollm
gORM-14B-test_qwen	gORM-14B-merged	test_qwen
gORM-14B-test_gemma	gORM-14B-merged	test_gemma
gORM-14B-test_llama	gORM-14B-merged	test_llama
gPRM-14B-test	gPRM-14B-merged	test
gPRM-8B-test	gPRM-8B-merged	test
gPRM-14B-test_smollm	gPRM-14B-merged	test_smollm
gPRM-14B-test_qwen	gPRM-14B-merged	test_qwen
gPRM-14B-test_gemma	gPRM-14B-merged	test_gemma
gPRM-14B-test_llama	gPRM-14B-merged	test_llama

Citation

@article{multi-rm,
  title   = {Rethinking Reward Models for Multi-Domain Test-Time Scaling},
  author  = {Lee, Dong Bok and Lee, Seanie and Park, Sangwoo and Kang, Minki and Baek, Jinheon and Kim, Dongki and Wagner, Dominik and Jin, Jiongdao and Lee, Heejun and Bocklet, Tobias and Wang, Jinyu and Fu, Jingjing and Hwang, Sung Ju and Bian, Jiang and Song, Lei},
  journal = {arXiv preprint arXiv:2510.00492},
  year    = {2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rethinking Reward Models for Multi-Domain Test-Time Scaling

Abstract

Quick Start

Synthetic Verfication Rationale Generation for gORM/gPRM (optional)

Training

Inference (reward)

Evaluation

Assets

Datasets

Model Checkpoints

Rewards on Datasets by Model

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
configs		configs
data_generation		data_generation
discriminative		discriminative
evaluation		evaluation
example		example
generative		generative
.gitignore		.gitignore
README.md		README.md
abstract.png		abstract.png
requirements.txt		requirements.txt

db-Lee/Multi-RM

Folders and files

Latest commit

History

Repository files navigation

Rethinking Reward Models for Multi-Domain Test-Time Scaling

Abstract

Quick Start

Synthetic Verfication Rationale Generation for gORM/gPRM (optional)

Training

Inference (reward)

Evaluation

Assets

Datasets

Model Checkpoints

Rewards on Datasets by Model

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages